512 research outputs found

    Language Model Combination and Adaptation Using Weighted Finite State Transducers

    Get PDF
    In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaption may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequence

    Adapting an Unadaptable ASR System

    Full text link
    As speech recognition model sizes and training data requirements grow, it is increasingly common for systems to only be available via APIs from online service providers rather than having direct access to models themselves. In this scenario it is challenging to adapt systems to a specific target domain. To address this problem we consider the recently released OpenAI Whisper ASR as an example of a large-scale ASR system to assess adaptation methods. An error correction based approach is adopted, as this does not require access to the model, but can be trained from either 1-best or N-best outputs that are normally available via the ASR API. LibriSpeech is used as the primary target domain for adaptation. The generalization ability of the system in two distinct dimensions are then evaluated. First, whether the form of correction model is portable to other speech recognition domains, and secondly whether it can be used for ASR models having a different architecture.Comment: submitted to INTERSPEEC

    Adapting an ASR Foundation Model for Spoken Language Assessment

    Full text link
    A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model. Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available. As the output of these models is designed to be human readable, punctuation is added, numbers are presented in Arabic numeric form and abbreviations are included. Additionally, these models have a tendency to skip disfluencies and hesitations in the output. Though useful for readability, these attributes are not helpful for assessing the ability of a candidate and providing feedback. Here a precise transcription of what a candidate said is needed. In this paper, we give a detailed analysis of Whisper outputs and propose two solutions: fine-tuning and soft prompt tuning. Experiments are conducted on both public speech corpora and an English learner dataset. Results show that we can effectively alter the decoding behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT

    N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space

    Full text link
    Error correction models form an important part of Automatic Speech Recognition (ASR) post-processing to improve the readability and quality of transcriptions. Most prior works use the 1-best ASR hypothesis as input and therefore can only perform correction by leveraging the context within one sentence. In this work, we propose a novel N-best T5 model for this task, which is fine-tuned from a T5 model and utilizes ASR N-best lists as model input. By transferring knowledge from the pre-trained language model and obtaining richer information from the ASR decoding space, the proposed approach outperforms a strong Conformer-Transducer baseline. Another issue with standard error correction is that the generation process is not well-guided. To address this a constrained decoding process, either based on the N-best list or an ASR lattice, is used which allows additional information to be propagated.Comment: submitted to INTERSPEEC

    Successive loadings of reactant in the hydrogen generation by hydrolysis of sodium borohydride in batch reactors

    Get PDF
    In this paper, for the first time, an experimental investigation is presented of five successive loadings of reactant alkaline solution of sodium borohydride (NaBH4) for hydrogen generation, using an improved nickel-based powder catalyst, under uncontrolled ambient conditions. The experiments were performed in two batch reactors with internal volumes of 0.646 l and of 0.369 l. The compressed hydrogen generated, at pressures below hydrogen critical pressure, gives emphasis on the importance of considering solubility effects during reaction, leading to storage of hydrogen in the liquid phase inside the reactor. The present work suggests that the sodium metaborate by-product formed by the alkaline hydrolysis of NaBH4, in a closed pressure vessel without temperature control, is NaBO2.xH2O, with x ≥ 2. The data obtained in this work lends credit to x ≈ 2, which was discussed based on the XRD results, and this call for increased caution in the definition of the hydrolysis reaction of NaBH4 up to temperatures of 333 K and up to pressures of 0.13 MP

    Blind Normalization of Speech From Different Channels

    Full text link
    We show how to construct a channel-independent representation of speech that has propagated through a noisy reverberant channel. This is done by blindly rescaling the cepstral time series by a non-linear function, with the form of this scale function being determined by previously encountered cepstra from that channel. The rescaled form of the time series is an invariant property of it in the following sense: it is unaffected if the time series is transformed by any time-independent invertible distortion. Because a linear channel with stationary noise and impulse response transforms cepstra in this way, the new technique can be used to remove the channel dependence of a cepstral time series. In experiments, the method achieved greater channel-independence than cepstral mean normalization, and it was comparable to the combination of cepstral mean normalization and spectral subtraction, despite the fact that no measurements of channel noise or reverberations were required (unlike spectral subtraction).Comment: 25 pages, 7 figure

    Zero-shot Audio Topic Reranking using Large Language Models

    Full text link
    The Multimodal Video Search by Examples (MVSE) project investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data

    IMPROVING MULTIPLE-CROWD-SOURCED TRANSCRIPTIONS USING A SPEECH RECOGNISER

    Get PDF
    ABSTRACT This paper introduces a method to produce high-quality transcriptions of speech data from only two crowd-sourced transcriptions. These transcriptions, produced cheaply by people on the Internet, for example through Amazon Mechanical Turk, are often of low quality. Often, multiple crowd-sourced transcriptions are combined to form one transcription of higher quality. However, the state of the art is to use essentially a form of majority voting, which requires at least three transcriptions for each utterance. This paper shows how to refine this approach to work with only two transcriptions. It then introduces a method that uses a speech recogniser (bootstrapped on a simple combination scheme) to combine transcriptions. When only two crowd-sourced transcriptions are available, on a noisy data set this improves the word error rate to gold-standard transcriptions by 21 % relative

    Antimicrobial activity of a library of thioxanthones and their potential as efflux pump inhibitors

    Get PDF
    The overexpression of efflux pumps is one of the causes of multidrug resistance, which leads to the inefficacy of drugs. This plays a pivotal role in antimicrobial resistance, and the most notable pumps are the AcrAB-TolC system (AcrB belongs to the resistance-nodulation-division family) and the NorA, from the major facilitator superfamily. In bacteria, these structures can also favor virulence and adaptation mechanisms, such as quorum-sensing and the formation of biofilm. In this study, the design and synthesis of a library of thioxanthones as potential efflux pump inhib-itors are described. The thioxanthone derivatives were investigated for their antibacterial activity and inhibition of efflux pumps, biofilm formation, and quorum-sensing. The compounds were also studied for their potential to interact with P-glycoprotein (P-gp, ABCB1), an efflux pump present in mammalian cells, and for their cytotoxicity in both mouse fibroblasts and human Caco-2 cells. The results concerning the real-time ethidium bromide accumulation may suggest a potential bacterial efflux pump inhibition, which has not yet been reported for thioxanthones. Moreover, in vitro studies in human cells demonstrated a lack of cytotoxicity for concentrations up to 20 µM in Caco-2 cells, with some derivatives also showing potential for P-gp modulation.This research was supported by national funds through FCT (Foundation for Science and Technology) within the scope of UIDB/04423/2020, UIDP/04423/2020 (Group of Natural Products and Medicinal Chemistry-CIIMAR), and under the project PTDC/SAU-PUB/28736/2017 (reference POCI-01–0145-FEDER-028736), co-financed by COMPETE 2020, Portugal 2020 and the European Union through the ERDF and by FCT through national funds and structured program of R&D&I ATLANTIDA (NORTE-01-0145-FEDER-000040), supported by NORTE2020, through ERDF, and CHIRALBIO ACTIVE-PI-3RL-IINFACTS-2019
    • …
    corecore